Data Harvesting UC3M
2025-03-27
Gather information from the official ATP website (rankings, player statistics…)
Build a scrapper to extract the data (Selenium)
Create a csv file and perform some analysis with it (Shiny)
ATP Ranking
Aces table:
age_country_link <- "https://tennisabstract.com/reports/atpRankings.html" |>
read_html()
age_country <- age_country_link |> html_table()
atp_table <- age_country[[1]]
atp_table
# Let´s clean the table:
atp_table <- atp_table[-1, ]
atp_table <- atp_table[-1, ]
atp_table <- atp_table[, 1:4]
colnames(atp_table) <- c("Rank", "Player", "Country", "Birthdate")
# Keep only the first 300 rows
atp_table <- atp_table |> slice(1:300)
# Calculate age from birthdate
atp_table <- atp_table |>
mutate(Birthdate = as.Date(Birthdate),
Age = floor(interval(Birthdate, today()) / years(1)))
# Keep only Rank, Last_Name, and Age
atp_table <- atp_table |> select(Rank, Player, Age, Country)
atp_table$Rank <- as.numeric(atp_table$Rank)Age + country table:
| ranking | name | %break_points_saved | points_won | total_points | Matches_played.x | %games_won(receive) | games_won_receiving | total_games_receiving | Matches_played.y | %games_won(serve) | games_won_serving | total_games_serving | Matches_played.x.x | Aces_number | Matches_played.y.y | 1st_serve_win_% | Points_won | Total_points | Matches_played | Index | Titles | Wins | Losses |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Sinner | 73.67% | 235 | 319 | 76 | 28.30% | 270 | 954 | 76 | 91.40% | 893 | 977 | 76 | 563 | 76 | 563 | 76 | NA | NA | NA | NA | NA | NA |
| 2 | Zverev | 69.73% | 258 | 370 | 90 | 22.26% | 254 | 1141 | 90 | 90.18% | 1029 | 1141 | 90 | 797 | 90 | 797 | 90 | NA | NA | 0.706 | 0 | 12 | 5 |
| 3 | Alcaraz | 61.85% | 167 | 270 | 64 | 30.79% | 218 | 708 | 64 | 85.47% | 606 | 709 | 64 | 265 | 64 | 265 | 64 | NA | NA | 0.833 | 1 | 15 | 3 |
| 4 | Fritz | 68.97% | 240 | 348 | 74 | 20.00% | 180 | 900 | 74 | 88.32% | 817 | 925 | 74 | 725 | 74 | 725 | 74 | NA | NA | 0.667 | 0 | 10 | 5 |
| 5 | Djokovic | 64.84% | 142 | 219 | 45 | 28.95% | 152 | 525 | 45 | 85.71% | 462 | 539 | 45 | 275 | 45 | 275 | 45 | NA | NA | 0.636 | 0 | 7 | 4 |
| 6 | Ruud | 62.97% | 233 | 370 | 73 | 23.79% | 201 | 845 | 73 | 83.90% | 714 | 851 | 73 | 400 | 73 | 400 | 73 | NA | NA | 0.769 | 0 | 10 | 3 |
| 7 | Draper | 60.68% | 179 | 295 | 59 | 24.83% | 183 | 737 | 59 | 84.15% | 616 | 732 | 59 | 486 | 59 | 486 | 59 | NA | NA | 0.867 | 1 | 13 | 2 |
| 8 | Medvedev | 61.03% | 260 | 426 | 67 | 27.49% | 229 | 833 | 67 | 79.78% | 655 | 821 | 67 | 426 | 67 | 426 | 67 | NA | NA | 0.667 | 0 | 12 | 6 |
| 9 | Rublev | 68.83% | 265 | 385 | 69 | 20.42% | 184 | 901 | 69 | 86.73% | 784 | 904 | 69 | 578 | 69 | 578 | 69 | NA | NA | 0.600 | 1 | 9 | 6 |
| 10 | Tsitsipas | 68.37% | 227 | 332 | 66 | 21.49% | 162 | 754 | 66 | 86.31% | 662 | 767 | 66 | 421 | 66 | 421 | 66 | NA | NA | 0.667 | 1 | 10 | 5 |
# Delete “Matches_played” and “Total_points” columns because they have 100% of NA
df <- df %>% select(-Matches_played, -Total_points)
# Merge column data into one column, taking non-NA values
df$matches_played <- coalesce(df$Matches_played.x, df$Matches_played.y, df$Matches_played.x.x, df$Matches_played.y.y)
# Delete original columns
df <- df %>% select(-Matches_played.x, -Matches_played.y, -Matches_played.x.x, -Matches_played.y.y)
# Delete unwanted columns
df <- df %>% select(-Points_won, -total_points, -points_won)